Here are a few tools you can use when you are working with data!
A few shoutouts:
Notebooks and markdowns are great ways to annotate/write as you code… This is the perfect example! We can write text to produce a document and code in chunks like this below:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gapminder)
R script are the original way of coding in R and it doesn’t have the nice functionality of writing notes while you code… though if you don’t need notes/writing, just throw all your code in an R script and you can run the whole thing in a single go pretty easily..
Wait! What are packages?
R packages are a unit of shareable code. * more info here
This means, people can download an R package and use the code that is written in there. Of course, you can build all your functions if you wanted to but if other’s have already done so… and they’ve been peer reviewed and tried and tested… why not use those?
Don’t reinvent the wheel!
Honestly… this is all you need. But if you want more…
There are tons…
Install package first… only have to do this once. ever. Use the library function to load the code in the package
# install.packages("tidyverse")
library(tidyverse)
# install.packages("gapminder")
library(gapminder)
What is Gapminder?
Free open dataset from Gapminder (an organization). * more info here
Need to download gapminder and load it before you can see it.
FYI Other open dataset here.
There are a few tools for you to use…
Gapminder is already loaded in through the package… so we’re all good here.
But if you want import your data… you can use read_csv() and if oyu put a question mark in front of it and read it… well. you’ll get all the info you need. You ca also use the import dataset on the top right of your R studio console and look for Import Dataset.. you’ll get ton’s of options there.
?read_csv
First thing we want to do is get familiar with our data…
View entire dataset
view(gapminder)
View summaries of dataset
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
head function to view first 5 rows
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
you can make this a whole lot prettier with kable() and kable_styling() functions fromknitr and kableExtra packages
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
gapminder %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
Basic statistical summaries of variables
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
Ok we know our data!
but wait.. quick changes… trust me..
gapminder <-
gapminder %>%
mutate(country = as.character(country),
continent = as.character(continent))
Does GDP impact life expectancy? … others?
summary(gapminder$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.60 48.20 60.71 59.47 70.85 82.60
summary(gapminder$gdpPercap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 241.2 1202.1 3531.8 7215.3 9325.5 113523.1
Plot the distribution of each variable (univariate) and the bivariate
gapminder %>%
ggplot(., aes(x = lifeExp)) +
geom_histogram() +
labs(
title = "Distribution of life expectancy",
subtitle = "All the data.. no filters",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Ok but maybe we want to filter on year or look across years?? so we see all the countries at one point in time?
Let’s look across time.
gapminder %>%
ggplot(., aes(x = lifeExp)) +
geom_histogram() +
labs(
title = "Distribution of life expectancy",
subtitle = "by year",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
) +
facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s look at one year.. say 2007
gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = lifeExp)) +
geom_histogram() +
labs(
title = "Distribution of life expectancy",
subtitle = "in 2007",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
… I don’t like these graphs as much yet so let’s make them a bit prettier with the ggthemes package.
library(ggthemes)
library(hrbrthemes)
hrbrthemes::import_roboto_condensed()
## You will likely need to install these fonts on your system as well.
##
## You can find them in [/Library/Frameworks/R.framework/Versions/4.0/Resources/library/hrbrthemes/fonts/roboto-condensed]
theme_set(theme_bw())
#or
theme_set(theme_ipsum())
We can also make these interactive…
We can also make these interactive with plotly or ggiraph
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- gapminder %>%
ggplot(., aes(x = lifeExp)) +
geom_histogram(colour = "black", size = 0.25) +
labs(
title = "Distribution of life expectancy",
subtitle = "in 2007",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
)
ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plot the distribution of each variable (univariate) and the bivariate
gapminder %>%
ggplot(., aes(x = gdpPercap)) +
geom_histogram(colour = "black", size = 0.25) +
labs(
title = "Distribution of GDP per capita",
subtitle = "All the data.. no filters",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Ok but maybe we want to filter on year or look across years?? so we see all the countries at one point in time?
Let’s look across time.
gapminder %>%
ggplot(., aes(x = gdpPercap)) +
geom_histogram(colour = "black", size = 0.25) +
labs(
title = "Distribution of GDP per capita",
subtitle = "by year",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
) +
facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s look at one year.. say 2007
gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = gdpPercap)) +
geom_histogram(colour = "black", size = 0.25) +
labs(
title = "Distribution of life expectancy",
subtitle = "All the data.. no filters",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p <- gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = gdpPercap)) +
geom_histogram(colour = "black", size = 0.25) +
labs(
title = "Distribution of life expectancy",
subtitle = "All the data.. no filters",
x = "Life expectancy",
y = "Frequency",
caption = "Data source: Gapminder from gapminder package"
)
ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
labs(
title = "Distribution of life expectancy",
subtitle = "in 2007",
x = "GDP per capita",
y = "Life expectancy (years)",
caption = "Data source: Gapminder from gapminder package"
)
How about adding country as another data element?
library(viridisLite)
gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = gdpPercap, y = lifeExp, colour = as.character(country))) +
geom_point() +
labs(
title = "Distribution of life expectancy",
subtitle = "in 2007",
x = "GDP per capita",
y = "Life expectancy (years)",
caption = "Data source: Gapminder from gapminder package"
) +
scale_colour_viridis_d()
Ok, the legend doesn’t work well with the graph.. too many countries! maybe take out the legend for now.
gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = gdpPercap, y = lifeExp, colour = country)) +
geom_point() +
labs(
title = "Distribution of life expectancy",
subtitle = "in 2007",
x = "GDP per capita",
y = "Life expectancy (years)",
caption = "Data source: Gapminder from gapminder package"
) +
scale_colour_viridis_d() +
theme(
legend.position = "none"
)
That’s better.
interactive? Yes please. You’ll see we can actually add in the legend again with the interactivity.
p <- gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = gdpPercap, y = lifeExp, colour = country)) +
geom_point() +
labs(
title = "Distribution of life expectancy",
subtitle = "in 2007",
x = "GDP per capita",
y = "Life expectancy (years)",
colour = "Country",
caption = "Data source: Gapminder from gapminder package"
) +
scale_colour_viridis_d()
ggplotly(p)
See here for transformations[http://www.sthda.com/english/wiki/ggplot2-axis-scales-and-transformations#log-and-sqrt-transformations]
Our data doesn’t look linear… transform our data? Use scale_x_log10()
gapminder %>%
filter(year == 2007) %>%
ggplot(., aes(x = gdpPercap, y = lifeExp, colour = country)) +
geom_point() +
geom_smooth(aes(group = 1), lty = 2, colour = "grey80", se = F) +
labs(
title = "Distribution of life expectancy",
subtitle = "in 2007",
x = "Log GDP per capita",
y = "Life expectancy (years)",
colour = "Country",
caption = "Data source: Gapminder from gapminder package"
) +
scale_colour_viridis_d() +
theme_bw() +
theme(
legend.position = "none"
) +
scale_x_log10()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
or create a new column in your data! using mutate()
gapminder2 <-
gapminder %>%
mutate(loggdp = log10(gdpPercap))
gapminder2 %>%
head()
## # A tibble: 6 x 7
## country continent year lifeExp pop gdpPercap loggdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 2.89
## 2 Afghanistan Asia 1957 30.3 9240934 821. 2.91
## 3 Afghanistan Asia 1962 32.0 10267083 853. 2.93
## 4 Afghanistan Asia 1967 34.0 11537966 836. 2.92
## 5 Afghanistan Asia 1972 36.1 13079460 740. 2.87
## 6 Afghanistan Asia 1977 38.4 14880372 786. 2.90
Plot that
p <-
gapminder2 %>%
filter(year == 2007) %>%
ggplot(., aes(x = loggdp, y = lifeExp, colour = country)) +
geom_smooth(aes(group = 1), lty = 2, colour = "grey80", se = F) +
geom_point() +
labs(
title = "Distribution of life expectancy in 2007",
subtitle = "in 2007",
x = "Log GDP per capita",
y = "Life expectancy (years)",
colour = "Country",
caption = "Data source: Gapminder from gapminder package"
) +
scale_colour_viridis_d()
ggplotly(p)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Linear model fit
gapminder2 %>%
filter(year == 2007) %>%
lm(data = ., lifeExp ~ gdpPercap)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = .)
##
## Coefficients:
## (Intercept) gdpPercap
## 5.957e+01 6.371e-04
#don't forget to log transform
gapminder2 %>%
filter(year == 2007) %>%
lm(data = ., lifeExp ~ log10(gdpPercap))
##
## Call:
## lm(formula = lifeExp ~ log10(gdpPercap), data = .)
##
## Coefficients:
## (Intercept) log10(gdpPercap)
## 4.95 16.59
linear regression line
gapminder2 %>%
filter(year == 2007) %>%
ggplot(., aes(x = loggdp, y = lifeExp, colour = country)) +
geom_smooth(method = "lm", aes(group = 1), lty = 2, colour = "grey80", se = F) +
geom_point() +
labs(
title = "Distribution of life expectancy in 2007",
subtitle = "in 2007",
x = "Log GDP per capita",
y = "Life expectancy (years)",
colour = "Country",
caption = "Data source: Gapminder from gapminder package"
) +
scale_colour_viridis_d() +
theme(
legend.position = "none"
)
## `geom_smooth()` using formula 'y ~ x'